The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

نویسندگان

  • Chakkrit Tantithamthavorn
  • Ahmed E. Hassan
  • Kenichi Matsumoto
چکیده

Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models. In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models. We also construct statistical models to better understand in which experimental design settings that class rebalancing techniques are beneficial for defect prediction models. Through a case study of 101 datasets that span across proprietary and open-source systems, we recommend that class rebalancing techniques are necessary when quality assurance teams wish to increase the completeness of identifying software defects (i.e., Recall). However, class rebalancing techniques should be avoided when interpreting defect prediction models. We also find that class rebalancing techniques do not impact the AUC measure. Hence, AUC should be used as a standard measure when comparing defect prediction models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

پیش‌بینی تعداد موارد بروسلوز براساس پارامترهای اقلیمی با استفاده از روش‌های داده کاوی شبکه‌های عصبی مصنوعی پرسپترون چند لایه، تابع پایه شعاعی و نزدیک‌ترین همسایگی

Background and Objectives: Identification of statistical models has a great impact on early and accurate detection of outbreaks of infectious diseases and timely warning in health surveillance. This study evaluated and compared the performance of the three data mining techniques in time series prediction of brucellosis.   Methods: In this time series, the data of the human brucellosis cases a...

متن کامل

Introduction to Schramm-Loewner evolution and its application to critical systems

In this short review we look at recent advances in Schramm-Loewner Evolution (SLE) theory and its application to critical phenomena. The application of SLE goes beyond critical systems to other time dependent, scale invariant phenomena such as turbulence, sand-piles and watersheds. Through the use of SLE, the evolution of conformally invariant paths on the complex plane can be followed; hence a...

متن کامل

Impact of socio-cultural evolution on the determining the middle-class housing typology (in the middle and final period of second Pahlavi)

Introduction: Institutional housing is multi-functional that in order to form it different dimensions must be considered. One of the most important of these dimensions is the social class of its inhabitants. Its social and cultural factors influence the formation of social classes in any society. The modernization of government in the Pahlavi era led to the formation of a new middle class along...

متن کامل

Machine learning algorithms in air quality modeling

Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...

متن کامل

Evaluation of Effectiveness of Main Factors on the Reduction of Loading and Discharging Performance Versus Loading and Discharging Rate of Dry Bulk Terminal (Case Study of Imam Khomeini Port)

The aim of this article is to measure the impact of main factors affecting the reduction of discharge and loading performance compared to dry bulk discharge and loading in terminal of Imam Khomeini Port. For this purpose, the actual data presented in Imam Khomeini Port for discharging and loading statistics and library documented data were used. In order to answer the research questions, multip...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1801.10269  شماره 

صفحات  -

تاریخ انتشار 2018